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The  results  of  this  study  were  significant  in  several  respects. 
First  of  all,  the  largest  data  base  of  controlled  conditions 
but  extemporaneous  human  speech  in  existance  was  developed  for 
the  project.  Secondly,  a  real-time  processing  capability  was 
developed  for  processing  this  data  base  of  40  hours  of  continuous 
speech.  Finally,  with  the  restriction  of  clean  speech  (not 
processed  over  a  telephone  or  with  noise) ,  text-independent 
speaker  recognition  results  nearly  matching  those  for  text- 
dependent  studies  were  achieved  by  averaging  over  30-40  second 
speech  segments, 
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J.D.  Markel,  Beatrice  T.  Oshika,  and  A. H .  Gray,  Jr., 
Long-Term  Feature  Averaging  for  Speaker  Recognition, 
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ASSP-25 ,  no.  4,  August  1977. 

J.D.  Markel,  and  S.B.  Davis,  Text-Independent  Speaker 
Recognition  from  a  Large  Linguistically  Unconstrained 
Time-Spaced  Data  Base,  to  be  published  in  IEEE  Trans . 
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Long-Term  Feature  Averaging  for 
Speaker  Recognition 

JOHN  D.  MARKEL  member,  nn  .  BEATRICE  T.  OSHIKA,  and  AUGUSTINE  II  GRAY,  JR  mi  mhi  r  ifh 


Abtlroct-  The  potential  benefits  of  long-term  parameter  averaging 
foi  speaker  recognition  were  investigated-  Parameters  studied  were 
pitch,  gain,  anJ  reflection  coefficients.  Parameter  variability  was 
computed  over  various  averaging  lengths  from  one  frame  a  eraging  tin 
effect,  no  averaging)  to  1000  frame  averaging  I  about  70  s  of  speech).  It 
was  demonstrated  that  the  bctwecn-to-within  speaker  variance  ratio, 
measured  over  sevciaj  speakers,  was  significantly  increased  by  perform 
ing  long-term  averaging  of  the  parameter  sets.  The  reflection  coefficient 
averages  tot  A ;  and  t,..  respectively .  were  shown  to  produce  the  highest 
variance  ratios. 


I.  Introduction 

HERE  have  been  several  similes  on  the  choice  of  acoustic 
features  in  speaker  recognition  tasks  |14).  119).  [22 1 
Average  fundamental  frequency  has  been  found  to  he  a  useful 
discrim.nating  feature  [13],  as  have  gam  measurements  |2|. 
1 10]  and  long-term  speech  spccita  [4|-[6],  |9|.  Perceptual 
studies  indicate  that  "there  is  at  least  weak  evidence  that  a 
voice  that  is  distinctive  to  listen  to  also  has  distinctive  spectro- 
graphic  patterns”  (20] .  and  that  dimensions  of  “characteristic 

Manuscript  received  May  13,  1976;  revised  December  13.  1976  jiul 
March  IS.  1977.  This  research  was  support'  d  by  the  Advanced  Re 
search  Projects  Agency  of  the  Department  of  Defense  and  was  "•  ini- 
tored  by  the  Office  of  Naval  Research  under  Contract  N00014-73-C 
0221. 

J.  D.  Markel  and  B.  T.  Oshika  are  with  the  Speech  Communications 
Research  Laboratory.  Inc..  Santa  Barbara,  CA  93109. 

A.  H.  Gray,  Jr.,  is  with  the  Speech  Communications  Research  Laboia- 
‘tory ,  Inc..  Santa  Barbara.  CA  93109  and  the  Department  of  Electrical 
Engineering  and  Computer  Science.  University  of  California  Santa 
Barbara. (A  93109. 


pitch”  and  “characteristic  loudness”  may  be  posited  to  differ 
entiate  among  speakers  (21).  These  speaker  characteristics 
can  be  distinguished  from  the  acoustic  cues  which  signal  lin¬ 
guistic  elements,  eg.  phonemes  ot  word-  For  example,  tin 
realization  of  the  word  "hit"  bv  a  tcniale  child  is  acoustical's 
very  difterent  from  the  same  word  pronounced  b>  an  ad  alt 
male,  yet  the  words  are  generally  understood  to  be  equivalent 
while  tlte  speakers  aie  clearly  different  it  appears,  the...  that 
listener s  adapt  to  speakers'  voice  characteristics  (as  well  a 
their  linguistic  characteristics). 

All  this  suggests  that  there  are  long-term  characteristics 
which  can  be  used  in  text-independent  speaker  recognition 
tasks.  Such  characteristics  include  long-term  averages  relatcu 
to  fundamental  frequency ,  gain,  and  spectral  averages 

The  motivation  for  long  term  averaging  in  text  independei 
speaker  recognition  is  based  upon  a  result  from  statistical  sam¬ 
pling  theory. 

We  assume  that  {/>(/)}  defines  statistically  independent, 
identically  distributed  samples  of  the  parameter  p  with  true 
mean  gi,,  and  variance  aj,.  (For  example.  correspond; 

to  the  reflection  coefficient  k ,  samples  for  each  analysis 
frame.)  If  x  =  (/>(/)>  defines  a  feature  based  upon  long-term 
averaging  ofp.  where 

|  Lv-l 

<p(i)>  =  —  V  p(i).  (1 

im  0 

and  /.„  is  the  number  of  voiced  analysis  frames  used  in  the  aver 
aging,  then  the  variance  of  x  is  given  in  terms  of  the  origina 
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parameter  variance  of  h> 

o%  (<p(/)>|  =  OplLu,.  (2) 

The  sample  '.irtanee  as  a  (unction  ot  I ,  is  an  import  in t  fimin 
of  merit  for  a  particular  leaiur  I  or  examp  t,  if  the  Icatures 
are  more  tightly  clustered  t<  •  tlict  about  the  sample  mejn  a-. 
/.„  increases  fiom  /  ,  -  I  (no  averaging),  then  i  tntrasp.a'.cr 
variability  is  decreased  and  the  parameters  would  h-  . \pi  ted 
to  result  in  higher  performance  in  text  independent  speaker 
recognition  tasks.  Although  no  “true  mean  or  true  .  at  tame  ’ 
exists  lor  real  speech  because  ot  physiological  vanatums  in 
human  speech,  it  is  reasonable  to  assume  that  at  'east  some 
convergence  or  clustering  of  parameters  will  < \.ui  with  lone 
term  averaging. 

The  purpose  of  this  pape  is  to  define  seveial  sets  ot  poten¬ 
tially  useful  long-term  features  and  then  to  investigate  their 
statistical  properties  as  a  function  of  the  averaging  length  /  , 
In  addition,  discrimination  tests  jre  presented  over  a  small 
homogeneous  set  of  speakers  to  illustrate  the  potential  benefits 
of  long-term  averaging  for  unconstrained  text-independent 
speaker  recognition. 

II  FtATURKS 

To  discuss  the  applicability  of  long-term  feature  averaging  in 
a  quantitative  manner,  we  have  chosen  thtee  different  feature 
sets  as  the  basis  for  analysis.  Some  of  these  features  reflect 
physiological  characteristics  more  closely  than  others. 

.4.  Fundamental  Frequency  Features 

Due  to  physiological  considerations  such  as  the  length  and 
thickness  of  the  vocal  folds,  and  respiratory  muscle  patterns, 
the  phonation  of  a  particular  vowel  with  “normal  effort”  may 
result  in  differing  rates  of  vocal  fold  vibration  (corresponding 
to  the  acoustical  correlate  of  fundamental  frequency)  for  dif¬ 
ferent  speakers  For  example,  a  child  will  hu'.  a  high  funda 
mental  frequency  compared  to  an  adult  because  ol  the  child  - 
smaller  vocal  folds. 

Although  fundamental  frequency,  along  with  intensity  and 
duration,  is  a  controllable  attribute  of  stress  and  intonation 
which  may  vary  widely,  each  person  appears  to  i.ave  a  mean 
fundamental  frequency  value  which,  if  averaged  <>vci  a  suffi¬ 
ciently  long  period  of  time,  is  relatively  constant  over  a  reason 
able  time  span  and  is  independent  of  linguistic  content  1 8 1 . 

In  addition,  the  standard  deviation  of  the  fundamental  fire- 
quency  over  a  long  interval  of  time  may  carry  important 
speaker-dependent  information.  For  example  if  the  speaker 
is  judged  to  be  a  monotone  speaker,  then  the  standard  devia¬ 
tion  would  be  expected  to  be  relatively  small.  However,  if  the 
speaker  is  thought  to  be  an  “expressive”  or  “forceful"  speaker, 
it  would  be  expected  to  be  relatively  large. 

B.  A  Gain  Feature 

It  seems  reasonable  to  assume  that  one  of  the  characteristics 
that  contributes  to  a  speaker’s  identity  is  the  amount  of  inten¬ 
sity  or  gain  variation  in  his  speech  over  time.  Subjectively,  the 
amount  of  gain  variation  is  possibly  correlated  with  the  per¬ 
ception  of  “dynamic”  versus  “fiat”  voices.  The  actual  gain 
variation  is  also  a  function  of  phonetic  content,  word  and 


phrase  stre-s  and  discourse  context.  For  example,  for  a  con¬ 
stant  subglottal  pressure,  the  acoustical  output  energy  for  an 
/a  is  about  5  dB  greater  than  for  a  /u '  \lso.  a  larger  gain 
variation  would  be  expected  with  an  exclamatory  as  opposed 
to  u  normal  declarative  sentence.  Out  assumption  is  that,  over 
j  sufficiently  long  interval  of  speech,  gam  variation  van  be 
considered  part  ot  the  individual  speakers  characteristics. 
That  tv.  a  speaker  who  is  judged  overall  to  be  an  “emphatic” 
s|  Aer  will  have  larger  gam  vart.>tion  than  one  who  is  judged 
to  have  a  usually  monotonous  voice. 

In  the  measurement  ot  gam  variation,  it  is  very  important 
that  results  be  only  a  function  of  speaker  characteristics  and 
not  absolute  system  gain.  Furthermore,  ccausc  ol  the  its 
ttnctly  different  pmduction  mechanism  between  voiced  and 
unvoiced  speech,  it  is  desirable  to  measure  tin  gain  variations 
only  durtnc  voiced  speech.  A  normalised  gain  variation  which 
satisfies  desired  physical  properties  is  now  defined.  If  R(nt 
defines  the  energy  ol  \  speech  sample-  ii(/|}  in  trame  n,  then 

,v  - 1 

A’(/i)  =  V  sJ(/|.  (3) 

r  =  o 

The  sample  mean  and  sample  variance  ol  A’(»i  lover/,  vowed 
frames  is  then  defined  by 

R  -  <A’(/i)>  (4) 

and 

o%-HR{n)  R)2)  (5) 

where  <■>  will  be  used  throughout  to  denote  averaging  over  /  , 
voiced  frames.  The  normalized  gain  variation  A  is  then  defined 
by 

&  =  okIR.  (b) 

If  the  •  .erall  system  gain  ts  changed  by  i  constant  value.  A  is 
unaffected.  Furthermore.  A  is  nonnegativc  with  A  =  0  only 
when  oK  =  0.  Physically  A  =  0  means  that  the  speech  enve¬ 
lope  (more  precisely  the  frame  energy )  is  unchanged  over  the 
complete  range  of  voiced  speech  analyzed. 

C.  Spe<  trai  Features 

it  is  well  established  in  the  literature  that  one  of  the  acousti¬ 
cal  features  that  tends  to  differentiate  one  particular  speaker 
from  another  during  voiced  speech  production  is  the  glottal 
sound  source  shape  [IS). 

Although  the  spectral  slope  of  a  singk  glottal  pulse  can  vary 
over  a  w  ide  range  from  nearly  whispered  speech  to  very  intense 
vocal  effort,  for  normal  conversational  speech  it  is  expected 
that  an  average  glottal  source  spectrum  could  be  obtained  over 
a  relatively  long  interval  of  speech  that  would  have  relatively 
small  intraspeaker  variability. 

Unfortunately,  glottal  volume  velocity  waveform  estimation 
from  speech  is  a  nontrivial  ‘ask  |7|,  |12|,  [16|.  A  more  di¬ 
rect  method  for  automatic  real-time  analysis  is  to  use  a  param¬ 
eter  set  that  is  related  to  the  smooth  characteristics  of  the 
spectrum,  which  is  independent  of  fundamental  frequency  or 
gain.  With  linear  prediction  analysis,  obvious  possibilities  arc 
filter  coefficients,  reflection  coeffh  rents,  or  log  atea  functions 
Sambur  ( 1 7 1  compared  these  coefficients  in  a  speech  recogni- 
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lion  experiment  and  decided  to  make  use  of  the  reflection 
coefficients.  Although  reflection  coefficients  are  nonlmcurly 
related  to  the  more  phy  sically  meaningful  smooth-spectral  jnd 
log  spectral  model  from  linear  prediction  analysis,  there  i 
ample  evidence  that  the1  do  contain  important  speaker- 
dependent  information  that  is  not  contained  in  fundamental 
frequency-  or  gain-related  parameters.  For  example,  in  the 
case  of  a  first-order  filter.  M  -  I  a  smooth  spectral  model  can 
be  physically  and  mathematically  lelaled  to  the  first  reflection 
coefficient.  This  model  |ll.  p.  1 39 ]  ha'  a  spectral  flatness 
given  by 

3C<1M)-  1  -  *f.  ,7) 

It  lite  speech  sample  being  analyzed  has  a  nearly  flat  spectral 
trend,  k,  approaches  zero  and  the  spectral  flatness  approa.  Ires 
unity.  As  the  spectral  slope  increases  negativ  ly ,  A ,  approaches 
*  I  and  the  spectral  flatness  approaches  zero. 

Based  upon  the  spectral  matching  properties  of  linear  predic 
lion  [11,  p.  1 34 ) .  we  would  assume  that  preemphasis  of  the 
data  would  be  beneficial  since  the  reflection  coefficients 
would  then  carry  more  information  about  the  spectral  struc¬ 
ture  at  higher  frequencies 

It  would  also  seem  easonable  that  it  long-term,  spectrally 
related  features  are  desired  which  minimize  intravariability, 
only  voiced  speech  should  be  analyzed.  Substantial  differences 
exist  in  the  physiological  mechanisms  which  produce  v  ised 
and  unvoiced  sounds.  Since  the  excitation  for  unvoiced  speech 
is  generally  assumed  to  have  a  flat  spe>  trum,  the  difference  in 
spectral  slope  between  voiced  and  unvoiced  sounds  may  be  on 
the  order  of  8-16  dB  With  only  voiced  sounds,  some  variation 
will  still  occur  since  different  articulator  positions  will  cause 
variations  on  the  acoustic  loading  at  the  glottis,  affecting  the 
glottal  source  shape.  This  variation,  however,  is  expected  to 
be  substantially  less  than  that  due  to  glottal  source  variations 


in  voiced-unvoiced  speech  production. 

D.  Summary  of  Feature  Definitions 
As  features  we  study  the  following. 

1 )  F0  average 

.x,  =  F0  =  <F0(u)>.  <8) 

2)  Standard  deviation  of  F0 

=«f0  =<[^0(«)-/:or>,/i.  (9) 

3)  Sample  gain  variation 

Xj  =orIR  (10) 

where 

R=(R(n ))  (II) 

and 

=<!/?(«)- «|2>,/a.  (12) 

4)  Spectrally  related  leatures(refiection  coefficient  averages) 

*,♦,=<*,(«)>  for  f  =  1,2,  •  ,.W.  (13) 

The  feature  vector  x  is  defined  hy 

*T=  l*i*2  *s.m1  (141 

where  T denotes  transpose. 
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III.  PRIX'EDURES 

A.  D.itj 

The  daia  used  lor  the  analysis  were  obtained  during  i  -- 
views  of  fou  speakers.  Kadi  interview  was  then  edited  so  l!i„i 
only  the  interviewees'  voices  re  anted  The  total  Juration  of 
each  edited  interview  (including  pauses)  was  typically  15 
18  min.  The  total  datj  bj^  used  for  this  study  was  appro.xt 
mately  one  hour  ir  duration  No  special  precautions  or 
recording  conditions  w  ere  imposed  on  the  expertmei  Inter 
views  were  conducted  in  normal  room  environments  with  a 
dynamic  microphone  and  an  audio  tape  recoidet  So  that  a 
small  numbei  of  speakers  could  he  used  with  some  generality 
in  extrapolating  results,  a  homogeneous  population  ol  four 
male  speakers  was  chosen,  each  having  somewhat  similar 
speech  characteristics  and  relatively  narrow  fundamental  t  re  - 
quency  ranges.  Histograms  ol  the  raw  nonavetaged  fundamcn 
tal  frequency  values  showed  substantial  overlap  among  the 
lour  speakers. 

!i  Digital  Processing  of  Data 

The  audio  tape  was  digitally  processed  using  the  system 
shown  in  Fig  I.  F.acli  test  segment  was  recoided  onto  a  disk 
using  a  nventional  procedures.  A  novel  part  of  the  procedure 

base  !  upon  the  use  of  a  high  speed  signal  processor  and  os¬ 
cilloscope  (for  visual  feedback  during  processing).  Using  an 
array -processing  software  system,  it  is  possible  to  process  the 
data  in  real  time  at  a  50  11/  analysis  frame  rate  from  a  Forti  m 
environment.  Processing  includes  modified  cepstra'  pitch 
period  and  voicing  detection,  gain  calculation,  linear  prediction 
analysis  for  reflection  coefficients,  and  a  running  mean  and 
mean -square  computation  of  these  parameters. 

The  procedure  for  generating  output  feature  vectors  to  be 
used  in  the  statistical  analysis  is  shown  in  Fig.  2.  A  counter 
for  frame  n  is  incremented  and  one  frame  of  speech  is  ana¬ 
lyzed.  The  parameters  useJ  are  sampling  frequency  f,  - 
6.5  kHz.  number  of  analysis  coefficients  M  =  10.  number  ol 
samples  for  reflection  coefficient  computation  -  128.  and  the 
number  of  samples  lot  and  gain  parameter  analysis  =  256 
(40  ms)  The  analysis  frame  rate  is  50  Hz.  Preemphasis  of  the 
speech  data  is  applied  using  a  dtfferencer.  I  z  1 . 

Fundamental  frequency  estimation  is  performed  with  a  mod 
died  cepstral  technique  After  the  spectrum  has  been  com 
puted,  a  symmetrical  window  function  is  applied  that  smoothly 
tapers  front  unity  at  1000  Hz  to  zero  from  1500  Hz  to  /,/.’ 
This  simple  modification  resolves  most  of  the  voicing  problems 
one  obtains  with  the  usual  cepstral  analysis  method  since  only 
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Fig.  2.  Procedure  for  generating  output  feature  vectors. 


the  most  consistent  region  of  harmonic  structure  is  used  |3] . 
Two  frames  of  delay  are  included  in  the  system  so  that  some 
amount  of  error  detection  and  correction  can  be  applied  in  the 
pitch  period  estimation.  One  additional  test  has  been  found 
necessary  for  obtaining  meaningful  feature  vectors.  A 
max  ( F0 )  and  mm  (F0)  value  are  chosen  for  the  speaker  being 
analyzed  to  ensure  against  gross  errors  causing  the  fundamen¬ 
tal  frequency  features  from  being  dramatically  affected  If 
min  (Fn)<Fo  <  max  (/•'<>).  the  frame  is  judged  to  be  voiced 
and  the  long-term  averages  are  updated.  The  frame  counter  is 
incremented  and  if  F>LV,  the  resultant  features  vector  x  ts 
output  to  disk.  /  is  reset  to  zero,  and  analysis  then  continues. 

IV.  Experiments 

A.  Experiment  I  Statistical  Variation  as  a  Fund  ton  o)  l.v 

The  complete  edited  audio  tape  for  speaker  />  (approxi¬ 
mately  18  min  in  duration)  was  analyzed  to  extract  long-term 
averaged  feature  vectors  for  several  L„  conditions.  As  a  time 
reference,  Lv  =  1000  corresponds  to  approximately  70  s.  The 
total  number  of  vector  samples  obtained  is  approximately 
inversely  proportional  to  Lv. 

Tire  unbiased  variance  estimate  of  the  feature  x  =  </>(/)> 
based  upon  the  speech  parameter  p  is 


Stand  ml  deviation  of  gam  related  features  as  a  I  unction  <»t  the 
number  of  voiced  frames  /,t,. 


tion.  Note  that  Lf  is  actually  a  function  of  /.„  since  the  total 
duration  is  fixed.  The  sample  mean  v  is  thus  independent  of 
l.v  except  for  sampling  variation  in  the  real-time  analysis  be¬ 
cause  it  is  not  possible  to  start  analysis  at  precisely  the  same 
location  on  the  audio  tape  when  /.,  is  changed.  The  true  vari 
ancc  o*  is  estimated  from  oj,  =  oJ(.v)  with  /.„  -  1  Features 
which  themselves  are  based  on  variances  (such  as.t2  =  <>f„  and 
x}  =  ori'R)  do  not  allow  for  a  Hue  variance  estimate  The 
sample  standard  deviations  of  the  fundamental  frequency- 
related  features  are  shown  in  Fig.  3  as  a  function  ot  /.,.  I  he 
estimated  standard  deviation  about  the  long-tcim  fundamental 
frequency  averages  is  reduced  from  about  lb  11/  tor  /.,.  -  10  to 
about  6  Hz  for  Lv  =  1000.  These  values  are  somewhat  higher 
than  the  long  term  F0  averages  reported  by  Horii  (b|.  How¬ 
ever.  this  experiment  is  based  upon  unconstrained  conversa¬ 
tional  speech,  whereas  Horn's  experiment  was  based  upon  a 


where 


Each  p(i)  explicitly  denotes  an  individual  feature,  and  Lf  is  the 
number  of  feature  vectors  obtained  over  the  total  speech  dura- 


\>S»*  2S  Mi  4.  Abl.lSl  1177 


II  II  1  K  XNSAt'IIONS  ON  AC  (US  I  K  S.  SIM  » (  H 


(b) 

I  ig  5.  id)  Standard  deviation  ol  reflection  coefficient  averages  js  a 
function  of  the  number  of  voiced  frames  J.v.  <k  j>,  (A  |0  deviations 
tb)  Standard  deviation  of  reflection  coefficient  averages  as  a  function 
of  the  number  of  voiced  frames  I.v.  rmsof  all  coefficient  variances 

rapidly  3S  predicted  by  sampling  theory  for  the  case  of  inde¬ 
pendent  samples  because  of  intraspeaker  variability,  tire  de¬ 
crease  is  substantial  and  is  surprisingly  linear  on  a  log-log 
scale.  Instead  of  a  L~uxl 1  relation,  the  standard  deviation  of 
the  reflection  coefficient  features  appears  to  approximately 
decrease  proportionally  to  a  / .k.1/3  model  beyond  =  10. 

The  mis  deviation  over  all  <X,>  averages  is  shown  in  Fig  5(b). 
Over  a  range  of  from  10  to  1000,  the  model  is  still 
seen  to  be  very  accurate  for  predicting  the  decrease  in  reflec¬ 
tion  coefficient  feature  parameter  variation  as  /.„  is  increased. 
The  measured  exponent  value  is  certainly  dependent  upon  the 
particular  speaker.  However,  it  appears  to  vary  only  slightly 
from  the  model  discussed  for  the  several  other  speaker 
measurements 

The  estimate  of  the  true  variance  for  the  Jt| ,  Am.  and  overall 
parameter  variance  is  also  shown  in  Fig.  5(a)  and  (b)  at  Lv  -  1. 


A  second  way  of  qualitatively  showing  the  effect  of  long¬ 
term  averaging  is  to  show  two-dimensional  scatter  diagrams  for 
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various  values  of  /  Fig.  6  shows  a  scatter  plot  of  (k2)  versus 
U|>  samples.  I. ash  point  is  based  upon  /. ,.  samples  from  the 
edited  audio  tape  for  speaker  i>  Shown  with  the  data  are  two 
sigma  ellipses  with  the  principal  axes.  A  dramatic  decrease  in 
the  dispersion  of  the  data  is  seen  as  /  ,  increases. 

B.  Experiment  2  Discriminate  >n  as  a  Functu. >n  of  l , 

The  approach  taken  here  is  to  investigate  the  effectiveness  of 
long-term  averaging  for  speaker  recognition  using  the  ratio  of 
the  between -speaker  variance  and  the  withm-speaker  variance, 
without  specifying  particular  speaker  recognition  experiments. 
Since  the  mathematics  of  this  procedure  (Fisher  discriminant 
method)  is  discussed  elsewhere  1 1 1 ,  only  the  necessary  details 
will  be  summarized  below  . 

A  within -speaker  covariance  matrix  U' is  computed,  aid  then 
a  normalized  between -speaker  covariance  matrix  tt'  is  found  in 
terms  of  the  matrix  B  of  Bricker  et  jl.  1 1 1  from 

B'  =  B/I.f  <  1 7) 

where  If  is  the  number  of  feature  vectors,  lire  normalization 
is  included  so  that  B’  will  depend  only  upon  the  sample  means, 
not  upon  the  number  of  feature  vectors.  I  igenvaluev  and 
eigenvectors  of  the  equation 

B  k  =  \kW<Bk  (18) 

are  then  obtained.  The  eigenvalues  are  ordered  from  highest 
to  lowest,  and  as  the  number  of  speakers,  four,  is  less  than  the 
number  of  features,  thirteen,  all  but  the  first  three  eigenvalues 
are  zero  [1 1 

X,  >X,  >X4  =  X,  =  ■  •■  =  X,,  =  0.  (19) 

A  new  coordinate  system  is  defined  using  the  eigenvectors  of 
(18)  as  base  vectors,  so  that  the  new  coordinate  vector  y  is 
related  to.r  through  the  linear  transformation 

y=4rx  (20) 

where  4>  is  the  matrix  whose  columns  are  the  eigenvectors  of 
(18).  The  eigenvalues  of  (18)  represent  the  variance  ratios  in 
the  directions  of  the  eigenvectors  with  X|  being  the  maximum 
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FEATURES 

VARIANCE  RATIO 

Ly  *  100 

Ly  =  1000 

yt 

10.959 
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0.972 
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0.393 
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y4 

0 

! 

0 

yis 

0 

’ 

0 

variance  ratio,  X2  being  the  next  largest  (in  a  direction  orthog¬ 
onal  to  6i),  etc.  Vatiance  ratios  can  also  be  computed  in  the 
original  coordinate  system  as  a  method  for  measuring  relative 
effectiveness  of  features. 

Tables  I  and  II  show  the  variance  ratios  in  the  original  and 
transformed  coordinate  systems,  respectively,  for  Lv  =  100  and 
Lu  =  1000.  Except  for  the  fact  that  parameter  correlation  is 
not  taken  into  account,  the  variance  ratio  values  can  he  taken 
as  quantitative  measures  of  the  original  parameter’s  effective 
ness  in  speech  recognition.  For  example,  we  see  that  aFn  pro¬ 
vides  very  little  discrimination  among  speakers,  whereas  <*;> 

I  appears  to  provide  the  max  mum  discrimination  among  speak 
ers  over  all  parameters  It  is  seen  that  the  first  dimension  in 
the  new  coordinate  system  results  in  a  substantial  y  increased 
variance  ratio. 


y. 


Fig.  7.  Scatter  plots  for  speakers  A-D  along  fust  three  1  isher  disc  rim; 
nan t  dimensions  100) 


1 

LV=I000 

c 

B  g> 

c 

* 

—  n 

0A 

J _ , - 1 

j  .-9  a 

0  t  2 

I 

— 

y3 

j  <9a 

D 

3 

y. 

t  ig  8.  Scatter  plots  for  speakers  A-P  along  fir  three  l  ishet  disciitni- 
nint  dimensions  (i.„  ■  1000). 


Two-dimensional  scatter  plots  of  the  first  three  translomred 
dimensions  ate  shown  tn  Fig  7  for  Lv  =  100  and  in  Fig  8  tor 
L„  =  1000.  The  results  are  based  upon  the  four  speakers  A-D. 
Also  shown  are  two-sigma  ellipses  and  the  principal  axes  for 
each  speaker  distribution.  In  Fig.  7  it  is  seen  that  .-1  and  D  are 
essentially  uniquely  separated  from  B  and  C  in  at  least  one 
plane  (>>,  -  Vj).  A  relatively  large  overlap  does  occur,  how¬ 
ever,  for  B  and  C  in  all  planes  A  cursory  comparison  of  Fig  7 
and  the  relative  sizes  of  clusters  tn  Fig.  b  will  illustrate  that 
substantial  benefits  in  discriminating  against  different  speakers 
have  been  obtained  over  using  no  averaging  (/.„  =  1 )  or  very 
limited  amounts  of  averaging  (/.„  =  10): 

In  Fig.  8,  it  is  seen  that  hy  performing  long  term  averaging 
with  L„  =  1000.  perfect  discrimination  is  obtained,  in  this 
instance  based  upon  only  a  two-dimensional  transformed  fea¬ 
ture  representation. 

The  variance  ratios  for  the  input  feature  variables  arc  shown 
in  Table  I  for  /.„  =  100  and  /.„  =  1000  If  the  variables  weic 
statistically  independent,  these  ratios  would  diffei  by  a  multi- 
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plicative  lac  (or  of  10  raihci  than  the  smaller  (actors  indicated 
in  .he  figure  The  orde'ing  ol  the  features  in  terms  ol  variance 
ratios  is  of  some  interest.  The  fifth  feature, k2 .clearly  shows 
the  largest  variance  rati. .  with  the  ninth  feature,  kt ,  the  next 
largest.  These  coefficients  correspond  to  the  coefficient  for 
the  highest  pow  er  ofc'1  in  the  models  of  order  2  and  6  found 
from  linear  prediction  analysis.  The  two-pole  model  has  been 
used  in  earlier  recognition  tasks  [18] 

The  variance  ratios  for  the  first  three  teatures  fundamental 
frequency,  its  standard  deviation,  and  sample  gain  variation 
are  smaller  than  what  one  might  expect  from  intuition  Pa  t 
of  the  reason  may  lie  in  the  fact  that  the  speakers  were  chosen 
to  have  similar  fundamental  frequency  ranges 
The  variance  ratios  for  the  new  coordinate  system,  the  eigen 
values  of  (18).  are  shown  in  Table  II  From  these  ratios  and 
the  scatter  diagrams  of  Fig  8.  it  can  be  seen  that  very  clear 
separation  of  the  speakers  is  indicated  for  the  long  term  aver 
age  case  of  Lv  =  1000  by  using  only  the  first  two  coordinates. 
>'x  and  y2,  in  the  direction  of  the  eigenvectors  and^. 

V'.  Discussion 

.-1.  Parameter  Variability  Over  Days,  Weeks,  Fn 

This  initial  study  has  been  restated  to  the  study  of  long 
teim  averages  taken  from  one  session.  This  is  probably  the 
reason  why  the  standard  deviation  of  the  long-term  averages 
tends  to  have  a  monotonically  decreasing  behavior.  Although 
some  amount  of  intraspeaker  variability  is  reflected  m  the  data 
additional  variability  will  occur  when  results  are  obtained  from 
sessions  separated  by  days,  weeks,  or  months  later  In  several 
studies  over  linguistically  constrained  umts.  this  effect  has 
been  shown  to  he  severe  beyond  several  months  for  short  text- 
dependent  segments  (4).  A  large  data  base  extending  over 
several  months  is  now  being  generated  for  studying  these  et- 
iects  in  conversational  speech. 

B.  Accuracy  of  Voicing  Decisi i  ms 

Since  ail  long-term  statistics  are  made  only  duiing  voicing.  ;t 
is  very  important  to  know  that  realistic  voicing  decisions  are 
made.  Spectral  slope  and  normalized  gain  variation  aie  direst 
computations  requiring  no  decisions  (except  for  voicing)  and 
are.  therefore,  very  robust. 

It  the  threshold  sett  ng  for  voicing  and  pitch  period  detection 
is  set  too  high  or  too  low.  the  effect  can  be  catastrophic  At 
one  extreme,  if  the  voicing  threshold  is  too  high,  very  few 
frames  will  be  included  in  the  statistics  as  being  voiced  (al 
though  they  will  be  very  reliable  estimates)  and.  furthermore, 
transitions  in  which  considerable  fundamental  frequency  varia¬ 
tions  may  occur  are  likely  to  be  missed,  causing  the  measured 
fundamental  frequency  standard  deviation  to  be  unrealistically 
small. 

At  the  other  extreme,  if  the  threshold  is  too  low  there  will 
be  a  tendency  to  define  fundamental  frequencies  near  the 
maximum  allowable  frequency  (minimum  pilch  period)  (near 
400  Hz)  during  actual  voiced  speech  and  at  random  values 
throughout  the  rest  of  the  allowable  range  during  unvoiced 
speech.  Although  a  pitch  period  and  voicing  decision  program 
with  several  frames  of  delay  is  used  for  error  detection  and 


correction,  it  is  essentially  impossible  to  separate  accurate  esti¬ 
mates  from  gioss  errors  beyond  some  reasonable  threshold. 

C  Assumptions  Versus  Experimental  Results 

It  was  assumed  mat  <F0>  carries  important  speaker  informa¬ 
tion  The  o [ v /- 0 > |  versus  Lv  graph  in  Fig  3  showed  a  signifi¬ 
cant  monotone  de.rease  as  Ll.  was  increased  In  adJino  the 
va.iance  ratio  was  relatively  high  (even  though  speakers  were 
purposely  chosen  with  stmilai  fundamental  frequency  anges) 
There  tore  this  assumption  appears  valid  The  assumption  that 
o(of  ]  is  meanimdu!  Joes  not  appear  to  b  true  for  eonversa 
tional  speech  The  vanance  ratio  tor  this  feaiure  is  extremely 
small.  This  result  contradicts  that  shown  by  Mead  [  13] .  where 
the  use  of  the  first  through  the  fourth  moments  ol7-'0  anj  ol 
the  first  foui  differences  of  F0  (resulting  in  20  features)  was 
suggested.  Our  experience  indicates  that  unless  hand-marked 
or  hand-corrected  F„  contours  are  used,  very  significant  biases 
m  resuits  can  occur  because  of  very  tew  gross  errors  in  Fn  eslt 
matiort  Higher  oider  dilterences  and  moments  only  magnify 
these  biases. 

Ilic  stai  ard  deviation  ol  the  gam  deviation  feature  as  a 
function  oi  /  ,  shows  a  weak  relationship  to  expectations  from 
statistical  sampling  theory.  In  rddition  the  variance  ratios  for 
the  ga  deviation  feature  are  relatively  small.  Although  some 
discrimination  is  obtained,  whal  we  have  seen  is  that  not  only 
is  there  substantial  intuspeaker  variability  for  this  parameter, 
but  that  in  addition,  consideiable  overlap  in  the  gain  feature 
values  occ  's  between  speakers.  Other  measures  of  fundamen- 
tai  trequency  and  gam  variations  mav  prove  to  he  more  useful 
than  the  ones  used  here,  which  aie  essentially  based  upon  root 
mean  squares  taken  about  the  aveiages  One  possibility  is  the 
use  ot  the  latio  ol  geometric  and  arithmetic  means  as  used  in 
evaluating  spectral  flatness  |  II  | 

The  long-term  averages  of  the  reflection  coefficients  as  a  set 
appear  to  be  the  most  significant  features  for  spcakei  recogni¬ 
tion  Not  only  does  the  standard  deviation  of  the  long  term 
averages  show  3  substantial  decrease  as  a  function  of  /  .  but 
in  addition,  the  vanance  ratios  are  seen  to  be  relatively  large 
tor  most  ot  the  parameters 

D.  Ob  sc  rial  'ns  on  Reflection  Coefficient  Averaging 
Although  o(<*10»<a«*,»  for  ail  Lv.  in  Fig.  5(a).  one 
should  not  be  misled  into  thinking  that  <A  10>  is  a  better  feature 
for  speaker  recognition.  This  result  occurs  because  A,  inher¬ 
ently  has  a  larger  standard  deviation  than  k,0  (<*,)  =  *,  for 
/•t  I).  The  important  fact  to  note  is  that  whatever  the  pa¬ 

rameter  deviation  is  without  averaging,  due  to  either  linguistic 
content  or  intraspeaker  variability,  it  decreases  as  L„a  where 
3  ^  a  ^  j  when  long-term  averaging  is  applied 
In  a  recent  paper  [1  ■  | .  the  use  of  orthogonal  linear  predic¬ 
tion  parameters  lor  use  ip  text-independent  speaker  r„c»gni- 
tion  studies  was  suggested.  Although  very  high  recognition 
scores  were  shown  using  the  orthogonal  linear  prediction  pa 
rameters.  we  would  suggest  that  substantial  reduction  11  score 
would  occur  if  unconstiained  data  bases  as  described  here  weie 
use-.  Whatever  scores  arc  obtained  using,  in  effect.  /.„  =  |, 
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our  results  qualitatively  indicate  tuat  substantial  improvements 
could  occur  bv  incorporating  long  term  averaging 
Each  orthogonal  parameter  was  obtained  from  a  linear  com¬ 
bination  of  all  reflection  coefficients  as 


text-independent  speake  recognition  tests  without  an>  lingms 
tic  or  structural  constraints. 
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where  the  ctl  terms  were  obtained  from  a  principal  component 
analysis  The  averaged  parameters  would  then  be 

<♦)>*£  cu<k,).  ,2*1) 

/•i 

Although  Fig  b  shows  nly  the  dispcision  characteristics  for 
<*,)  and  </tj>.  similar  cha  cteristics  are  obtained  for  all  the 
coefficients  The  amount  oi  data  dispersion  will  be  primarily 
due  to  the  value  of  L„,  not  the  fact  that  a  linear  combination 
of  the  kj  terms  (or  the  U(>  terms)  has  been  obtained 

E.  Computational  Considerations 

Studies  of  this  type  place  „  premium  on  the  available  pro 
cessing  speed  of  the  computer  system.  It  became  clcai  early 
in  the  study  that  small-  or  medium-scale  computer  capability 
was  insufficient  For  example,  the  analysis  method  described 
runs  in  approximately  100  times  real  time  if  all  opeiatiotis  are 
implemented  only  on  the  PDP-1 1  system  The  relat.vely  small 
data  base  of  speakers  for  this  study  would  have  required  over 
100  hours  of  processing  time. 

Except  for  the  nontrivial  costs  in  software  development  we 
have  found  that  attaching  a  high-speed  processor  to  the  mam 
computer  system  provides  i  very  economical  solution  to  the 
requirements  for  real-time  processing. 

VI.  Summary 

The  properties  of  long-term  feature  averaging  tor  three  sets 
of  fundamental  frequency  related,  gam  related,  and  spectrally 
related  parameters  have  been  investigated.  Based  upon  the 
Fisher  discriminant  method,  the  rank  ordering  of  the  param¬ 
eter  sets  in  importance  was  shown  to  be  spectral,  fundamental 
frequency,  and  then  gain.  It  was  also  shown  that  over  a  long 
duration  from  =  lOto/.g,  =  1000.  the  standard  deviation  of 
the  sample  means  of  the  reflection  coefficient  vectors  de¬ 
creased  proportionally  to  l.~tn . 

A  small  number  of  speakers  with  relatively  homogeneous 
characteristics  was  used  to  illustrate  the  effects  of  long-term 
averaging.  The  data  base  was  of  nontrivial  duration,  somewhat 
greater  than  one  hour  in  iength.  Furthermore,  the  text  was 
unconstrained  conversa'  onal  speech,  recorded  under  normal 
room  noise  conditions.  Analysis  was  performed  in  real  time 
with  a  high  speed  signal  processor 
Presently,  other  spectral  representation  methods  aie  being 
investigated  and  a  data  base  is  being  developed  for  performing 
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Abstract 


Text-Independent  Speaker  Recognition 
from  a  Large  Linguistically  Unconstrained 
Time-Spaced  Data  Base 


] 


John  D.  Markel  and  Steven  B.  Davis 


A  very  large  data  base  consisting  of  over  thirty-six  hours  of 
unconstrained  extemporaneous  speech,  from  seventeen  speakers,  recorded  over 
a  period  of  more  than  three  months,  has  been  analyzed  to  determine  the 
effectiveness  of  long-term  average  features  for  speaker  recognition. 
Results  are  shown  to  be  strongly  dependent  on  the  voiced  speech  averaging 
interval  L^.  Monotonic  increases  in  the  probability  of  correct 
identification  and  monotonic  decreases  in  the  equal  error  probability  for 
speaker  verification  were  obtained  as  Lv  increased,  even  with  substantial 
time  periods  between  successive  sessions.  For  Lv  corresponding  to 
approximately  thirty-nine  seconds  of  speech,  text-independent  results  (no 
linguistic  constraints  embedded  into  the  data  base)  of  98.05%  for  speaker 
identification  and  4.25%  for  equal  error  speaker  verification  were 


obtained . 
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I.  Introduction 


In  recent  years,  there  has  been  an  increasing  interest  in  computer- 
based  techniques  for  text-independent  speaker  recognition  (1-6) . 
Recognition  is  used  here  to  encompass  both  speaker  identification  and 
verification  (7)  .  The  term  "text-independent"  has  been  used  in  several 
different  contexts.  For  example,  Atal  (1)  has  used  the  term  in  the  sense 
of  choosing  independent  randomized  test  frames  from  a  single  sentence  to 
use  against  the  remaining  frames  as  a  reference  set.  Sambur  (4)  has  used 
the  term  in  an  experiment  where  the  sentences  in  the  test  set  were 
different  from  those  in  the  reference  set,  even  though  each  speaker  read 
precisely  the  same  list  of  sentences. 

Although  useful  insight  has  been  gained  by  these  approaches,  they  were 
linguistically  constrained.  In  many  practical  situations,  where  text- 
independent  speaker  recognition  is  desired,  there  typically  will  be  no 
control  over  the  speech  being  tested.  As  Beek ,  Neuberg  and  Hodge  (8)  have 
pointed  out,  text-independent  speaker  identification  can  overcome  problems 
which  may  arise  if  the  speaker  is  uncooperative,  and  there  is  a  great 
Interest  for  speaker  identification  over  communications  channels,  which 
have  no  linguistic  constraints.  Furthermore,  there  may  be  days  to  weeks  of 
separation  between  reference  and  test  sessions. 

Several  other  studies  (2, 3, 5, 6)  have  analyzed  data  with  varying 
amounts  of  linguistic  constraints.  Li  and  Walker  (2)  used  thirty  seconds 
of  speech  read  from  the  rainbow  passage  (9)  recorded  once  by  twenty-two 
male  speakers  and  twice  by  an  additional  eight  male  speakers.  They  did  not 
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specify  the  number  of  days  separating  the  recordings.  They  demonstrated 
that  distances  among  spectral  correlation  matrices  could  be  used  to  compare 
inter-speaker  and  intra-speaker  differences.  However,  the  same  text  was 
used  for  all  tests,  which  could  be  interpreted  as  a  linguistic  constraint. 

Hunt,  Yates  and  Bridle  (6)  used  approximately  six  two-  to  three-minute 
long  FM  radio  weather  forecasts  from  each  of  eleven  male  and  two  female 
speakers.  Each  forecast  was  divided  into  twenty-  or  thirty-second 

intervals  and  long-term  fundamental  frequency  and  cepstral  coefficient 
features  were  computed  for  twenty-millisecond  sequential  frames  in  each 
interval.  They  did  not  specify  the  number  of  days  between  successive 
forecasts  by  the  same  speaker.  Using  Fisher  discriminant  analysis  (10), 
they  achieved  89%  correct  speaker  identification  with  independent  test  and 
reference  sets.  However,  the  speakers  read  text  with  some  effort  at 
uniformity  between  sessions,  which  could  also  be  interpreted  as  a 
linguistic  constraint. 

In  a  preliminary  study,  Markel,  Oshika  and  Gray  (5)  used  one 
fifteen-  to  eighteen-minute  interview  from  each  of  four  male  speakers  with 
somewhat  similar  speech  characteristics.  The  interviews  were  recorded  with 
an  audio  tape  recorder  in  a  normal  room  environment.  Long-term  fundamental 
frequency,  gain  and  reflection  coefficient  features  were  computed  for  every 
1000  sequential  voiced  frames  (twenty-millisecond  windows  per  frame,  fifty 
frames  per  second)  in  each  interview.  Using  the  same  Fisher  discriminant 
analysis  (10)  as  Hunt  et  al .  to  transform  the  data,  they  achieved  perfect 
discrimination  among  the  four  speakers.  These  recorded  interviews  were 
considered  to  be  free  of  linguistic  constraints.  However,  the  data  were 
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insufficient  to  obtain  statistically  significant  results,  and  with  only  one 
session  per  speaker,  there  was  no  analysis  of  speaker  characteristics  over 
time . 


The  purpose  of  this  paper  is  to  present  results  from  experiments  in 
speaker  recognition  where  there  were  no  linguistic  constraints  on  the 
speech  content  (other  than  the  ones  implied  when  the  speaker  is 
cooperative,  and  English  is  used).  In  comparison  with  the  previous  study 
(5),  results  are  presented  for  a  larger  number  of  speakers,  for  multiple 
sessions  from  each  speaker,  and  for  a  greater  number  of  features. 
Furthermore,  the  effects  of  time  between  recording  sessions  are  studied. 
For  practical  implementation,  only  parameters  obtained  from  the  analysis 
portion  of  a  linear  prediction  vocoder  (fundamental  frequency,  gain  and 
reflection  coefficients)  were  used.  (Beek  et  al .  (8)  have  stated  that  the 

reflection  coefficients  are  currently  favored  for  all-digital  narrowband 
communications  systems.)  This  study  shows  that  if  these  parameters  are 
averaged  over  sufficiently  long  intervals  of  time,  such  as  thirty  seconds 
or  more,  the  features  obtained  are  essentially  free  of  linguistic 
constraint,  and  speaker  recognition  performance  is  comparable  with  some 
text-dependent  speaker  recognition  experiments.  The  linguistic  results 
agree  with  Li  and  Walker  (2)  ,  who  used  a  smaller  data  base;  long-term 
speech  features  are  relatively  stable  after  thirty  seconds.  Furthermore, 
this  study  shows  that  if  the  averaging  interval  is  too  fiort,  speaker 
recognition  performance  is  unacceptable  with  linguistically  unconstrained 
extemporaneous  speech.  In  addition,  the  importance  of  having  a  time-spaced 
reference  set  of  sufficient  size  is  demonstrated. 
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II.  Data  Base  and  Processing  Methodology 


A  data  base  was  collected  by  recording  170  fifteen-minute  interviews 
from  eleven  male  and  six  female  speakers.  There  were  ten  sessions  per 
speaker,  with  each  session  separated  by  a  minimum  of  one  week.  Generally, 
the  successive  sessions  were  obtained  within  two  to  three  weeks.  One 
exceptional  separation  between  successive  sessions  was  fourteen  weeks. 

All  sessions  were  recorded  on  a  Tandberg  9000X  two-track  tape  recorder 
at  a  recording  speed  of  7.5  ips.  One  track  was  used  to  record  the 
interviewer  and  the  other  track  was  used  to  record  the  speaker.  The 
speaker  was  recorded  with  a  B  and  K  half-inch  condenser  microphone  and 
amplifier  system  in  an  IAC  sound  room  equipped  with  a  window.  The 

interviewer  was  recorded  with  a  conventional  dynamic  microphone  outside  of 
the  sound  room.  Two-way  communication  was  established  using  headphones. 

Each  session  began  with  the  speaker  reciting  his/her  name,  a  password, 
a  word  list  and  the  first  paragraph  of  the  rainbow  passage  (9)  .  The 
interviewer  posed  a  topic  to  the  speaker,  and  the  remaining  time  (generally 
twelve  to  thirteen  minutes)  was  devoted  to  an  extemporaneous  monolog  by  the 
speaker.  The  interviewer  responded  briefly  when  appropriate,  or  when  it 
was  necessary  to  ask  a  new  question  for  continuity. 

A  wide  range  of  topics  were  covered,  from  describing  a  job  to 
describing  a  frightening  experience.  Although  one  might  argue  that  this 
approach  in  some  sense  constrained  the  data,  casual  listening  of  the 
recordings  demonstrates  that  this  is  not  the  case.  The  topics  generally 
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provided  a  springboard  for  the  speaker's  thoughts,  and  the  speech  was 
usually  conversatonal ,  fluent  and  quite  varied.  (With  one  subject,  the 
suggested  topic  was  consistently  replaced  by  a  wide  variety  of  topics.) 

Several  observations  should  be  noted  which  may  be  of  considerable 
importance  in  practical  situations.  After  the  initial  recording  gain 
calibration  for  each  session,  no  further  gain  adjustments  were  made. 
Subjects  occasionally  became  bored  or  distracted,  and  either  lowered  their 
voice  intensity  or  turned  their  heads  away  from  the  microphone. 

Conversely,  subjects  occasionally  became  intense  on  a  topic  and  nearly 
"swallowed"  the  microphone,  resulting  in  substantial  low  frequency  waveform 
variablity  due  to  breath  bursts.  Also,  there  was  some  stuttering,  throat 
clearing,  laughter,  giggling  and  poor  articulation. 

In  addition  to  these  conditions,  about  half  of  the  subjects  acquired 
various  degrees  of  colds  during  a  two  to  three  week  period.  All  of  these 
cases  were  recorded  in  the  normal  fashion,  and  no  hand  editing  or  deletion 
of  any  data  was  performed.  The  data  used  in  this  study  consisted  of  only 
the  extemporaneous  speech  material  from  the  speakers,  excluding  the  rainbow 
passage,  word  lists,  etc.  The  total  duration  of  the  data  base  is  17 
speakers  x  10  sessions/speaker  x  approximately  13  minutes/session,  or 
approximately  36.8  hours  of  data. 

Several  large  population  and  long  duration  data  bases  have  been 
reported  in  the  literature  (10,11).  These  were  all  text-dependent  studies 
with  short  names  or  phrases.  However,  even  the  total  duration  of  the  large 
data  base  used  by  Das  and  Mohn  is  only  one-tenth  the  total  duration  of  the 
data  base  used  in  this  study.  The  magnitude  of  this  data  base  was 
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extremely  valuable  for  choosing  feature  subsets  and  defining  reference  sets 
which  spanned  varying  periods  of  time. 

Each  audio  tape  was  manually  cued  to  the  location  where  the 
extemporaneous  portion  of  the  interview  began.  Then  real-time  linear 
prediction  analysis  and  disk  storage  of  the  analysis  parameters  was 
initiated.  The  data  were  low  pass  filtered  at  3250  Hz  and  sampled  at  a 
6500  Hz  rate  for  compatibility  with  future  applications  to  telephone 
systems  and  narrowband  vocoder  systems.  The  speech  samples  were 

preemphasized  with  a  factor  of  0.9,  successive  128-point  frames  were 
multiplied  by  a  Hamming  window,  and  the  autocorrelation  method  of  linear 
prediction  was  used  at  a  rate  of  fifty  frames  per  second.  The  analysis  was 
performed  in  real-time  under  Fortran  control  using  a  commercially  available 
array  processing  system  in  conjunction  with  a  PDP  11/45  computer  (4,12). 
The  analysis  parameters  for  each  speech  frame  were  ten  reflection 
coefficients,  pitch  period  (obtained  from  a  modified  cepstral  pitch 
tracker)  and  gain,  and  were  stored  in  a  quantized  format  of  eight  bits  (one 
byte)  per  parameter.  The  process  was  terminated  when  the  end  of  the  tape 
was  reached  (defined  as  a  thirty-second  silence  interval) .  The  processing 
of  each  interview  resulted  in  an  analysis  file  of  approximately  1000  disk 
blocks  (512  bytes/block) ,  and  all  interviews  together  required  nearly  half 
the  total  space  of  a  200-Mbyte  disk  (340,670  formatted  disk  blocks).  In 
comparison,  it  would  require  ten  200-Mbyte  disks  to  digit  -se  all  of  the 
interviews  with  12  bits/sample  and  to  store  directly  without  preprocessing. 


Next,  the  analysis  files  were  used  to  obtain  long-term  feature 
vectors,  where  each  vector  was  the  average  of  successive  voiced  analysis 


frames.  Unvoiced  and  silence  frames  were  not  included  in  this  study,  since 
it  was  felt  that  fundamental  frequency  was  an  essential  speaker-dependent 
parameter.  The  vocoder  analysis  parameters  consisted  of  fundamental 
frequency  (the  reciprocal  of  the  pitch  period)  ,  gain  and  ten  reflection 
coefficients.  For  every  interval  Ly,  long-term  features  based  upon  the 
mean,  standard  deviation  and  dispersion  (standard  deviation  divided  by 
mean)  of  the  twelve  parameters  were  computed,  resulting  in  thirty-six¬ 
dimensional  feature  vectors.  This  feature  set  was  defined  in  a  reasonably 
general  manner  since  analytic  techniques  for  feature  reduction  may  be  used 
to  find  the  most  reasonable  feature  subsets  for  speaker  recognition. 

A  summary  of  the  number  of  feature  vectors  produced  for  all  170 
interviews  is  given  in  Table  1.  In  this  table,  the  data  are  partitioned 
into  representative  test  and  reference  sets  (13) .  Four  choices  of  were 
studied,  namely  *  30,  100,  300  and  1000.  The  total  number  of  feature 
vectors  and  the  average  real-time  interval  per  feature  vector  as  functions 
of  are  also  given. 

TABLE  1  GOES  HERE 


It  is  important  to  consider  the  relationship  between  a  particular 
value  of  Lv  and  the  real-time  interval  of  a  long-term  feature  vector.  Most 
significantly,  a  fixed  number  of  voiced  frames,  rather  than  all  of  the 
voiced  frames  from  a  fixed  elapsed-time  interval,  was  chosen  for  analysis. 
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With  extemporaneous  speech,  there  may  be  intervals  of  ten  to  twenty  seconds 
where  very  little  or  no  voiced  speech  occurs  (the  speaker  may  pause,  cough, 
laugh,  etc.),  leading  to  a  variable  voiced  frame  rate.  If  long-term 
features  were  a  function  of  the  voiced  frame  rate,  then  such  features  would 
not  be  reflective  of  only  a  speaker's  speech  sounds,  but  also  his/her 
speech  rate  and  style.  While  these  additional  characteristics  might  be  a 
source  of  speaker-dependent  information,  they  were  not  considered  in  this 
study,  and  consequently  long-term  features  were  made  independent  of  the 
voiced  frame  rate. 

The  real-time  interval  for  a  long-term  feature  (seconds/feature) 
corresponds  to  a  product  of  the  following  factors:  1)  the  number  of  voiced 
frames  per  feature  vector  (L^) ,  2)  the  reciprocal  of  the  voiced  frame  to 
total  frame  ratio  (or  the  reciprocal  of  the  voicing  duty  factor)  ,  and 
3)  the  reciprocal  of  the  number  of  analysis  frames  per  second  (or  the 
reciprocal  of  the  frame  rate)  .  In  a  previous  study  (5)  ,  the  voicing 
threshold  was  set  such  that  very  smooth  fundamental  frequency  (FQ)  contours 
were  observed  on  a  real-time  display  system,  and  as  a  result,  Lv  =  1000 
corresponded  to  approximately  seventy  seconds  of  real  speech.  For  this 
study,  the  voicing  threshold  was  determined  by  synthesizing  the  speech 
using  the  Fg  contour  obtained,  and  then  selecting  the  threshold  that 
produced  the  subjectively  best  synthesis.  The  ear  appears  more  sensitive 
to  voiced  speech  segments  which  are  synthesized  as  unvoiced,  rather  than 
the  reverse,  i.e.  buzziness  is  typically  preferred  over  whispery  or  hoarse 
speech.  As  a  result,  more  voiced  decisions  were  made,  and  Lv  =  1000  in 
this  study  corresponded  to  approximately  thirty-nine  seconds  of  speech. 


The  feature  vectors  for  each  interview  for  each  of  the  above  values  of 
Lv  required  approximately  301,  93,  33  and  13  disk  blocks  respectively,  and 
a  total  of  74,800  disk  blocks  were  required  to  store  the  feature  vectors 
for  the  various  conditions  for  the  170  interviews.  These  data  were  then 
further  processed  as  described  in  the  next  section. 

III.  Experiments  in  Parameter  Variability 

A.  Intra-Speaker  Variability 

In  a  previous  study  (5),  the  within  speaker  (intra-speaker)  variablity 
of  the  features  for  one  male  speaker  was  demonstrated  to  be  a  monotonically 
decreasing  function  of  Ly  from  Lv  =  1  to  Ly  »  1000  for  a  single  fifteen 
minute  session.  Using  the  data  base  in  this  study,  it  was  possible  to 
study  the  intra-speaker  variability  for  a  larger  number  of  male  and  female 
speakers,  and  in  addition,  it  was  possible  to  study  the  intra-speaker 
variability  for  cumulative  sessions.  If  individual  sessions  are  described 
by  S(i),  i*l,10,  then  cumulative  sessions  may  be  described  by  C(i),  i*l,10, 
where  C(i)  *  S (1) +S (2) +. . ,+s (i) . 

The  standard  deviations  of  the  long-term  averages  of  the  fundamental 
frequency  and  the  first  reflection  coefficient,  denoted  as  <FQ>  and  <kj_> 
respectively,  as  measured  over  the  cumulative  sessions  C(i)  for  one  male 
and  one  female  speaker,  are  shown  in  Figure  1. 
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For  both  speakers  and  for  each  set  of  cumulative  sessions,  <Fg> 
decreases  as  Ly  increases.  This  behavior  demonstrates  that  over  long 
intervals,  a  speaker's  average  fundamental  frequency  is  (probably)  a  good 
estimator  of  a  characteristic  or  "habitual"  value,  and  for  successive  long 
intervals,  the  deviation  from  the  habitual  value  is  small.  For  short 
intervals,  influences  such  as  speech  prosody  may  mask  the  habitual  value, 
and  successive  short  intervals  will  deviate  more  widely  from  each  other. 
This  concept  of  habitual  fundamental  frequency  is  paralleled  by  the  concept 
of  habitual  (perceived)  pitch;  the  latter  is  used  in  speech  therapy  as  a 
measure  of  acoustic  improvement  during  treatment  of  a  functional  or  organic 
voice  disorder  (14),  and  is  an  important  factor  in  listener-based  speaker 
recognition.  For  both  speakers  and  for  each  value  of  ,  there  is  a  trend 

for  <Fq>  to  increase  as  more  sessions  are  included  (although  there  are 
exceptions,  e.g.  for  the  female  speaker,  <FQ>  for  C(l)  is  greater  than 

<Fq>  for  C(2)).  The  dependence  of  <Fg>  on  Ly  can  approximately  be 
described  as  proportional  to  which  agrees  with  the  theoretical 

relationship  between  the  variance  of  a  set  of  samples  of  a  stationary 
random  process,  e.g.,  the  Lv  samples  of  FQ,  and  the  variance  of  the  process 
(5)  .  In  absolute  terms,  the  standard  deviation  of  the  long-term 
fundamental  frequency  averages,  over  a  time  span  of  more  than  three  months, 
varies  from  17-23  Hz  at  Ly  =  30  to  4-8  Hz  at  Lv  =  1000  for  the  male 
speaker,  and  from  28-33  Hz  at  Ly  =  30  to  8-11  Hz  at  Lv  =  1000  for  the 
female  speaker. 

The  behavior  of  <k^>  as  Ly  increases  mirrors  the  behavior  of  <Fg>  as 
ky  increases.  Since  the  value  k^  is  a  monotonic  function  of  spectral 
slope  of  a  first-order  linear  prediction  inverse  filter  for  speech  (5,15), 


m 
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then  a  parallel  explanation  in  terms  of  "habitual  spectral  slope"  may  be 
given,  i.e.,  the  longer  the  interval,  the  better  the  estimate  of  the 
habitual  spectral  slope.  However,  as  more  sessions  are  included,  the 
behavior  of  <k^>  differs  from  the  behavior  of  <FQ>.  For  a  9iven  Lv , 
there  is  essentially  no  measurable  increase  in  k1  variability  as  the  time 
period  increases  from  one  fifteen  minute  session  to  a  period  of  nearly 
three  months,  with  all  ten  sessions  included.  This  trend  is  observed  for 
the  other  speakers  and  the  other  long-term  reflection  coefficient  averages, 
thus  substantiating  the  presence  of  an  "habitual  spectral  characteristic" 
for  each  speaker.  Since  the  reflection  coefficients  are  used  to  describe 
the  vocal  tract  shape  in  an  acoustic  tube  model  (16)  ,  the  result  implies 
that  the  physical  characteristics  of  a  subject’s  vocal  tract  show  no 
observable  changes  over  at  least  several  months. 

Furui  et  al .  (17-20)  have  examined  speaker  variability  over  intervals 
from  a  few  weeks  to  several  years.  Their  studies  dealt  with  the 
variability  of  repeated  word  lists  and  short  sentences.  They  found  that 
for  increasing  time  intervals  from  about  three  weeks  to  three  months, 
spectral  parameters  such  as  reflection  (PARCOR)  or  cepstral  coefficients 
showed  increasing  variation.  In  contrast,  the  standard  deviation  of  the 
reflection  coefficients  in  this  study  show  essentially  no  variation  over 
time.  Perhaps  the  data  of  Furui  et  al  ♦  were  too  linguistically 
constrained,  and  speakers  never  approached  their  habitual  spectral 
characteristic. 

In  summary,  inter-speaker  variability  based  on  averaged  features 
decreases  monotonically  as  the  averaging  interval  increases.  Furthermore, 


14 


for  a  large  averaging  interval,  inter-speaker  feature  variability  is 
relatively  consistent  over  a  time  period  of  three  months.  The  next  aspect 
of  this  study  is  a  comparison  which  includes  the  intra-speaker  information, 
e.g.  a  feature-by-feature  analysis  which  uses  the  values  of  each  feature 
from  all  subjects.  If  some  features  have  small  inter-speaker  variance 
compared  to  the  intra-speaker  variance,  then  those  features  will  not  be 
useful  for  speaker  recognition,  and  the  performance  of  a  classifier 
designed  to  recognize  speakers  from  these  features  may  be  poor. 


B.  Variance  Ratio  Analysis 


One  method  of  measuring  the  usefulness  of  a  feature  for  speaker 
recognition  is  the  F-ratio  or  variance  ratio  (also  referred  to  as  the 
generalized  Fisher  discriminant)  (7,10,19).  The  variance  ratio  of  a 
feature  is  the  quotient  of  the  inter-speaker  variance  and  the  intra-speaker 
variance  (11).  In  general,  the  larger  the  variance  ratio  for  a  particular 
feature,  the  greater  the  probable  contribution  of  the  feature  in 
distinguishing  the  speakers  (13)  ,  but  this  property  is  strongly  dependent 
on  the  data  and  the  experimental  procedure.  However,  the  variance  ratio 
does  not  account  for  inter-feature  correlations,  and  if  two  features  with 
high  variance  ratios  are  highly  correlated,  then  the  inclusion  of  both 
parameters  might  be  somewhat  redundant  (7). 


1.  Trends  as  a  function  of  population 
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The  variance  ratios  for  the  case  Ly  =  1000 

and  cumulative  sessions  1-10  are  shown  in  Figure  2  for  the  male  and 
female  speakers  separately,  and  in  Figure  3  for  two  subsets  of  the  male 
speakers.  Only  the  variance  ratios  of  the  mean  and  standard  deviation 
features  are  shown.  The  variance  ratios  of  the  dispersion  features  were 
consistently  low,  and  therefore  believed  to  contribute  very  little  toward 
speaker  recognition  in  this  study. 

There  are  noticable  differences  in  the  variance  ratios  between  the 
male  and  female  populations.  Based  on  relative  magnitudes,  the  features 

<  (kg)>,  <  (kg)>  and  <k^>  would  be  the  most  significant  for  identifying  the 
male  population,  while  <  (k7)>,  <  (kg>  and  <kg>  would  be  the  most 

significant  for  identifying  the  female  population.  If  the  male  population 
is  arbitrarily  divided  into  two  equal-sized  subsets,  there  are  pronounced 
changes  in  the  variance  ratios.  For  the  first  set  of  male  speakers,  <k^>, 
<FQ>  and  <k2>  have  the  largest  variance  ratios,  and  for  the  second  set  of 
male  speakers,  <k4>,  <kg>  and  <k3>  have  the  largest  variance  ratios.  These 
results  show  the  need  to  have  a  substantially  larger  speaker  population  in 
order  to  characterize  the  parameters  of  major  importance.  However,  it  is 
estimated  that  to  obtain  variance  ratios  which  would  exhibit  consistent 
trends  for  a  set  of  speakers  and  for  subsets  of  the  speakers,  a  much  larger 
data  base,  possibly  more  than  100  speakers,  would  be  required. 

In  the  previous  paper  (5)  ,  for  a  smaller  and  more  homogeneous  data 

base,  <k2>  and  <kg>  were  found  to  be  the  most  significant  parameters. 
These  large  variance  ratios  would  be  physical  evidence  for  the  importance 
of  the  first  and  third  formants  in  voiced  speech  (5) .  Tnis  larger 
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population  base,  however,  shows  no  such  relationships.  The  conclusion  is 
that  for  studies  with  linguistically  unconstrained  speech,  parameter 
ranking  using  variance  ratios  should  be  used  cautiously.  The  parameters 
with  large  variance  ratios  may  change  depending  on  how  the  data  are 
partitioned,  and  the  features  with  small  variance  ratios  may  be  important 
for  achieving  good  speaker  recognition  if  the  data  partitioning  is  changed. 
(Conversely,  it  will  be  shown  that  some  parameters  with  small  variance 
ratios  may  actually  degrade  speaker  recognition.) 


2.  Trends  as  a  function  of  Lv  and  time-spacing 


The  variance  ratios 

were  determined 

for 

the 

case  Ly  =  100 

and 

cumulative  sessions  1-10 

(Figure  4 ) ,  and 

for 

the 

case  L  =  1000 

V 

and 

cumulative  sessions  1-2  (Figure  5).  Comparing  Figures  2  and  4,  which  only 
differ  by  the  averaging  interval  L ,  the  variance  ratios  generally  maintain 
the  same  relative  relationships,  i.e.  the  features  which  have  the 
relatively  larger  variance  ratios  for  =  1000  also  have  the  relatively 
larger  variance  ratios  for  L v  =  100.  However,  the  absolute  values  of  the 
variance  ratios  are  smaller  for  Lv  =  100  than  for  Lv  =  1000.  Comparing 
Figures  2  and  5,  which  only  differ  by  the  number  of  sessions.,  the  relative 
relationships  and  the  absolute  values  of  the  variance  ratios  are  similar 
for  two  cumulative  sessions  and  for  ten  cumulative  sessions.  However,  a 
slight  decrease  in  the  absolute  values  of  the  variance  ratios  for  ten 
cumulative  sessions  is  observed.  If  the  inter-speaker  variance  is  assumed 
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relatively  constant  for  two  or  ten  cumulative  sessions,  then  the  slight 
decrease  in  variance  ratios  for  ten  sessions  over  two  sessions  correlates 
with  the  slight  increase  in  standard  deviations  observed  in  Figure  1.  This 
result  further  establishes  that  a  speaker's  habitual  features,  when 
measured  over  a  relatively  long  interval  (greater  than  thirty  seconds),  do 
not  show  appreciable  changes  over  time  periods  up  to  three  months. 


3.  Further  observations 


It  is  also  evident  that  the  variance  ratios  for  the  mean  features 
generally  have  larger  values  than  the  corresponding  variance  ratios  for  the 
standard  deviation  features.  The  variance  ratios  for  the  dispersion 
features  are  in  turn  substantially  lower  in  value  than  the  corresponding 
variance  ratios  for  the  standard  deviation  features.  Features  based  upon 
gain  have  consistently  small  variance  ratios. 


IV.  Speaker  Recognition 


Speaker  recognition  was  based  on  a  weighted  Euclidean  distance  metric 
(5,7,11),  where  the  mean  vector  and  inverse  covariance  matrix  for  each  of 
the  seventeen  speakers  were  estimated  from  feature  vectors  in  the  specified 


reference  set.  All  thirty-six  dimensions  were  used  initially.  The 
distances  between  each  reference  class  and  each  test  vector  were  computed, 
and  the  test  vector  was  assigned  to  the  reference  class  which  yielded  the 
smallest  distance.  For  speaker  identification,  a  tally  was  taken  of  the 
number  of  correct  choices.  For  speaker  verification,  the  distances  were 
stored  for  further  analysis  with  a  variable  distance  threshold.  The  method 
of  cross-validation  in  both  directions  was  used  (11),  where  independent 
subsets  of  the  data  were  cyclically  treated  as  test  and  reference  groups, 
and  the  speaker  recognition  scores  for  each  cycle  were  averaged  for  the 
final  scores. 

Atal  (7)  and  Bricker  et  al .  (10)  discussed  three  possible  choices  for 
a  distance  metric.  Each  metric  was  a  positive  semidefinite  form  which 
could  be  described  by  d  =  (X-Y^  m  (X-Yi)T,  where  X  was  a  vector  to  be 
classified,  Y^  was  the  mean  vector  for  class  i,  and  M  was  a  weighting 
matrix.  The  choices  for  M  were  a  pooled  covariance  matrix  W--1-  from  all 
speakers,  an  individual  covariance  matrix  from  each  speaker,  or  a 
discriminant  matrix  D  composed  of  the  eigenvectors  of  W-1  B,  where  B  was 
the  between-class  covariance  matrix. 

The  use  of  the  discriminant  matrix  D  requires  sufficient  knowledge  of 
the  inter-speaker  variability,  which  may  be  difficult  to  attain  unless  an 
extremely  large  number  of  speakers  is  used.  Atal  (7)  and  Bricker 
et  al .  (10)  preferred  the  pooled  covariance  matrix  W  over  the  individual 
covariance  matrix  W-^.  Their  rationale  was  that  data  limitations  (less 
samples  than  dimensions)  frequently  result  in  a  singular  (noninvertible) 
covariance  matrix,  and  that  one  pooled  covariance  matrix  would  adequately 
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represent  all  speakers,  even  though  speaker  dependent  data  is  contained  in 
individual  covariance  matrices  and  subsequently  is  not  used. 

From  Table  1,  the  average  number  of  feature  vectors  per  speaker  per 

session  for  Ly  =  30,100,300,1000  is  685  (116411/170),  205  (34862/170),  68 
(11563/170)  and  20  (3413/170)  respectively.  For  L  =  30,100  or  300,  with 

V 

thirty-six  dimensions,  the  individual  covariance  matrices  were  never 
singular  for  any  number  of  pooled  sessions.  For  =  1000,  with  thirty-six 
dimensions,  the  individual  covariance  matrices  were  singular  if  less  than 
three  sessions  are  pooled.  Furthermore,  Kanal  (14)  has  suggested  that  ten 
times  the  number  of  dimensions  is  an  adequate  sample  size  for  good 
covariance  matrix  estimates  with  normal  probability  distribution 
assumptions.  For  five  sessions  and  thirty-six  dimensions  in  a  reference 
class,  the  factors  which  relate  sample  size  to  dimensionality  for 
Lv  =  30,100,300,1000  are  95  (685*5/36),  28  (205*5/36),  9  (68*5/36)  and  3 

(20*5/36)  respectively.  For  Lv  =  1000,  sessions  as  long  as  forty-five 

minutes  would  have  been  necessary  to  produce  a  factor  near  ten,  but  a 
factor  as  large  as  ten  is  probably  not  needed  for  features  which  are 
themselves  the  average  of  1000  frames  of  data.  However,  fifteen  minutes 
was  a  sufficient  duration  for  the  other  values  of  L  as  well  as  an  upper 
limit  of  endurance  for  the  subject  and  interviewer.  It  was  felt  that  the 
advantages  gained  through  the  use  of  individual  covariance  matrices 
outweighed  potential  problems  of  undersampling  the  speaker -s  statistics. 
In  a  practical  situation,  relatively  long  sessions  would  be  necessary  for 
sufficient  accumulation  of  speaker's  reference  data,  but  thereafter  the 
speaker  could  be  verified  approximately  every  thirty-nine  secon 


A.  Trends  as  a  function  of  Lv 


For  the  first  series  of  tests,  the  first  five  sessions  were  treated  as 
the  reference  data,  the  second  five  sessions  were  treated  as  the  test  data, 
and  then  vice-versa.  Results  are  shown  in  Table  2.  In  Table  2a,  it  is 
seen  that  the  average  scores  for  the  probability  of  correct  identification 
P(CI)  monotonically  increase  from  60%  to  nearly  92%  as  L increases  from  30 
to  1000  respectively.  A  confusion  matrix  of  identification  errors  shows 
that  no  one  speaker  is  more  difficult  to  identify  than  any  other  speaker. 
In  Table  2b,  as  increases,  the  speaker  verification  equal  error 

probability  (probability  of  false  acceptance  P(FA)  equals  the  probability 
of  ialse  rejection  P(FR))  monotonically  decreases  from  43.1%  to  8.8%.  This 
trend  is  principally  due  to  the  P(FA)  behavior,  since  the  P(FR)  behavior 
does  not  change  appreciably  with  (5).  Although  the  distance  threshold 
for  a  given  probability  of  correct  acceptance  and  fixed  dimensionality 
(under  multivariate  normal  assumptions)  may  be  analytically  obtained,  the 
distance  threshold  for  the  equal  error  probability  can  only  be  determined 
experimentally.  In  Table  2c,  the  equal  error  probability  distance 
threshold  is  seen  to  monotonically  increase  as  Lv  increases. 


TABLE  2  GOES  HERE 
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It  is  interesting  to  illustrate  the  difference  between  text- 
independent  speaker  recognition  with  and  without  linguistic  constraints. 
Sarabur  has  proposed  an  orthogonal  linear  prediction  set  of  parameters  for 
text-independent  speaker  recognition  (4).  Within  the  context  of  a 
linguistically  constrained  experiment  where  all  speakers  spoke  the  same  set 
of  sentences,  Sambur ' s  text-independent  results  (in  the  sense  that  the 
reference  sentences  were  different  from  the  test  sentences)  were  near  94%. 
The  orthogonal  linear  prediction  parameters  are  essentially  equivalent  to  a 
linear  transformation  of  the  long-term  reflection  coefficients  averages 
used  in  this  study  if  =  l  (equivalent  to  no  averaging)  .  If  all 
linguistic  constraints  are  removed,  and  if  little  or  no  averaging  is  used, 
the  results  of  Table  2  indicate  that  the  speaker  identification  scores  for 
a  true  text-independent  situation  with  a  reasonable  number  of  speakers  wi)l 
be  quite  poor  (even  for  Ly  =  30,  p(CI)  is  bounded  from  above  at  62%).  A 
similar  statement  follows  for  the  case  of  speaker  verification. 


B.  Trends  as  a  function  of  time  spacing 


Rosenberg  (21)  has  noted  that  one  of  the  most  important  considerations 
in  designing  a  data  base  is  the  time  period  over  which  -tterances  are 
collected  and  the  methods  for  establishing  reference  patterns  over  time. 
Following  the  pictorial  scheme  of  Furui  et  al .  (14-17)  for  illustrating 

reference  and  test  sets  over  time,  speaker  recognition  for  four -^aases  shown 
in  Figure  6  were  investigated.  Reference  sets  were  composed  of  from  two  to 
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five  successive  sessions  (with  a  time  interval  of  at  least  one  week  between 
sessions) .  No  comingling  such  as  odd-numbered  reference  sessions  and  even- 
numbered  test  sessions  was  allowed.  For  each  case,  the  reference  and  test 
sets  were  composed  of  equal  numbers  of  successive  independent  sessions,  and 
two-direction  recognition  tests  (as  described  above)  were  made  for  the  four 
Lv  cases. 

The  results  are  presented  in  Table  3.  It  is  seen  that  for  all  Lv 
conditions,  higher  scores  were  obtained  as  the  number  of  cumulative 
sessions  increased. 


TABLE  3  GOES  HERE 


The  differences  in  the  speaker  identification  score  between  the  first 
two  sessions  and  the  first  five  sessions  is  around  15%  for  all  cases 

shown  (Lv  8  1000  was  not  used  for  two  sessions  since  the  covariance 
matrices  were  singular)  .  It  is  interesting  to  note  that  in  a  text- 
dependent  speaker  verification  experiment  with  different  parameters  and 
approaches.  Luck  (22)  found  that  speech  samples  collected  over  a  five  week 
period  gave  the  best  results. 


C.  Trends  as  a  function  of  feature  subsets 
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In  a  previous  section,  it  was  noted  that  the  dispersion  features  had  very 
small  variance  ratios,  whereas  the  mean  features  as  a  group  consistently 
had  the  largest  variance  ratios.  How  would  recognition  scores  compare  if 
the  dispersion  features  were  omitted,  or  if  only  the  mean  features  were 
included?  The  recognition  test  for  Lv  =  1000  and  five  sessions  per 
reference  and  test  set  was  repeated  using  several  different  feature 
subsets,  based  on  an  analysis  of  the  magnitudes  of  the  variance  ratios.  In 
one  case,  only  the  twelve  mean  features  were  used,  and  in  a  second  case, 
only  the  twenty-four  mean  and  standard  deviation  features  were  used.  The 
average  scores  for  the  two  cases  were  P(CI)  *  93.6%  with 
P  (FA)  =  P (FR)  =  14.5%,  and  P(CI)  =  96.8%  with  P(FA)  *  P(FR)  =  7.2% 
respectively.  For  comparison,  the  comparable  average  scores  for  all 
thirty-six  features  (Table  2)  were  P(CI)  =  91.6%  with  P(FA)  =  P(FR)  =  8.8%. 

Not  only  did  both  of  these  new  cases  based  on  feature  subsets  yield 
better  scores  than  the  original  thirty-six  dimension  feature  set,  but  in 
the  second  case,  the  identification  score  was  markedly  increased  by  more 
than  5%.  This  result  is  a  significant  practical  illustration  that  the 
inclusion  of  some  parameters  which  would  hopefully  improve  performance  (or 
at  worst  case  would  have  no  effect  on  performance),  can  sometimes  actually 
degrade  the  system  performance  in  an  open  test.  In  a  closed  test  with  the 
distance  metric  used  in  this  study,  where  a  reference  set  also  is  used  as  a 
test  set,  this  theoretically  cannot  happen.  Closed  tests  on  -this  data  base 
verified  that  monotonic  increases  in  the  number  of  features  produced 
monotonic  increases  in  the  P(CI)  and  monotonic  decreases  in  equal  error 
probability  P(FA)  *  P(FR). 
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This  improved  performance  by  eliminating  features  with  relatively 
small  variance  ratios  was  the  basis  for  one  additional  test  with  a  feature 
subset.  In  considering  the  remaining  twenty-four  features,  the  gain- 
related  features  had  very  small  variance  ratios,  and  furthermore,  the 
inclusion  of  gain-related  features  was  difficult  to  physically  justify.  In 
fact,  it  could  be  argued  that  even  if  these  features  helped,  they  should 
not  be  included  because  they  may  simply  reflect  a  speaker's  position, 
interest,  etc.  during  the  recording  session.  Therefore,  the  recognition 
test  with  only  twenty-four  features  was  repeated  with  the  gain-related 
features  removed,  and  the  performance  of  this  last  test  with  only  twenty- 
two  parameters  was  better  than  any  previous  test.  The  final  results  of 
this  study  using  only  the  twenty-two  fundamental  frequency  and  reflection 
coefficients  long-term  averages  are  shown  in  Table  4. 


TABLE  4  GOES  HERE 


These  results  are  extremely  promising  for  future  studies  in  many  areas  of 
speaker  recognition.  This  substantially  large  testing  effort  (over  eighty 
million  distance  measurements)  has  shown  that  realistic  and  acceptable 
speaker  identification  and  speaker  verification  can  be  achieved  with 
text-independent  linguistically  unconstrained  speech. 


The  cumulative  probability  functions  (Figure  7A)  and  the  probability 
density  functions  (Figure  7B)  for  false  rejection  and  false  acceptance  may 
be  used  to  compare  the  inter-  and  intra-speaker  distances  in  the 
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verification  task.  These  curves  are  derived  from  the  first  half  of  the 
final  speaker  verification  test  with  22  features,  =  100G,  reference 
sessions  1-5  and  test  sessions  6-10.  The  equal  error  point  is  graphically 
depicted  as  the  crossover  point  of  the  two  cumulative  probability  curves  in 
Figure  7A.  This  equal  error  point  is  found  at  a  distance  threshold  where 
the  probability  of  false  acceptance  (i.e.  acceptance  of  an  imposter)  is 
equal  to  the  probability  of  false  rejection  (i.e.  rejection  of  a  correct 
speaker) . 

The  probability  density  functions  (pdfs)  in  Figure  7b  show  the 
distribution  of  the  intra-  and  inter-speaker  distances.  The  crossover 
point  in  Figure  7A  divides  each  of  the  pdfs  into  two  sections,  with  the 
area  under  the  intra-speaker  pdf  to  the  right  of  the  dividing  line  equal  to 
the  area  under  the  inter-speaker  pdf  to  the  left  of  the  dividing  line.  For 
this  data,  the  equal  error  crossover  point  is  close  to  the  intersection  of 
the  two  pdfs,  but  only  identical  and  symmetric  pdfs  will  always  have 
identical  crossover  and  intersection  points. 

For  test  sessions  6-10  with  =  1000,  there  were  a  total  of  1708  test 
vectors  from  the  17  speakers.  The  distances  between  each  of  these  test 
vectors  and  the  correct  reference  speaker  comprise  the  intra-speaker 
distance  space.  A  histogram  of  these  intra-speaker  distances  is  shown  in 
Figure  8A.  The  mean  and  standard  deviation  of  the  histogram  distances  were 
used  to  approximate  normal  and  log-normal  distributions.  For  the  open 
test,  there  is  no  underlying  theoretical  distribution,  and  a  chi-square 
test  was  used  to  measure  the  goodness  of  fit  of  the  normal  and  log-normal 
distributions.  The  log-normal  distribution  had  the  smallest  chi-square 
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measure.  Analogously,  the  distances  between  each  of  the  1708  test  vectors 
and  each  of  the  16  incorrect  speakers  (i.e.  eliminating  the  reference 
speaker  who  is  a  correct  match  to  the  test  vector)  comprise  the  inter¬ 
speaker  distance  space.  A  histogram  of  the  27,318  inter-speaker  distances 
is  shown  in  Figure  8B.  A  log-normal  distribution  is  a  better  fit  to  the 
inter-speaker  histogram  than  a  normal  distribution,  but  not  as  good  a  fit 
as  with  the  intra-speaker  histogram. 


V.  Summary 


The  significance  and  value  of  long-term  feature  averaging  for  text- 
independent  speaker  recognition  with  linguistically  unconstrained  speech 
has  been  demonstrated.  This  study  used  practical  analysis  conditions  of 
telephone-range  spectral  width  (0-3250  Hz)  and  parameters  obtained  from  a 
linear  prediction  vocoder.  All  parameter-related  computations  were 
performed  in  real  time  using  16-bit  integer  arithmetic,  and  all  parameters 
were  further  quantized  into  an  8-bit  format  for  efficient  disk  storage. 

The  recording  environment  was  controlled  by  recording  the  speakers 
with  a  condenser  microphone  in  an  IAC  sound  room.  An  important  extension 
of  this  work  would  be  to  reprocess  the  "clean-text"  audio  tapes  through 
various  channel  disturbances  such  as  the  telephone  system  to  determine  the 
robustness  of  the  approach  in  less  ideal  environmental  conditions  (20). 
Also,  in  some  situations,  reference  data  may  be  obtained  m  a  clean 
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environment  and  subsequent  speaker  recognition  attempted  in  a  noisy 
environment.  This  area  also  requires  investigation. 

Although  seventeen  speakers  is  not  a  trivial  population  size,  it 
appears  that  for  determining  the  importance  of  individual  features  for 
speaker  recognition  using  linguistically  unconstrained  text,  a 
substantially  larger  population  base  is  required.  It  was  found  that 
features  obtained  from  only  one  or  two  sessions  of  a  given  population  are 
relatively  unchanged  over  a  much  larger  number  of  time-spaced  sessions, 
where  there  was  at  least  one  week  between  sessions.  Other  features  should 
also  be  investigated.  It  has  been  suggested  that  mean  deviations  (21)  may 
prove  more  useful  than  the  standard  deviations  used  in  this  study.  Further 
research  is  also  required  to  assess  the  conditions,  e.g.,  the  number  of 
long-term  samples  from  a  speaker,  for  obtaining  a  good  estimate  of  the  mean 
and  variance  of  a  speaker's  characteristics. 

An  assumption  throughout  has  been  that  only  voiced  speech  frames  are 
to  be  used  in  the  analysis.  If  this  assumption  was  not  necessary,  or  if 
only  slight  degradation  occurred  if  both  voiced  and  unvoiced  speech  frames 
were  included,  the  process  would  be  simplified  computationally,  and  in 
addition,  1000  frames  per  average  would  correspond  to  a  real  time  interval 
only  about  half  as  long  as  required  here. 

The  best  speaker  recognition  was  obtained  when  1)  five  sessions 
successively  separated  by  at  least  one  week  were  used  to  define  the 
reference  set,  2)  the  mean  and  standard  deviation  of  the  long-term  averages 
of  the  fundamental  frequency  and  reflection  coefficients  were  used,  and 
3)  each  feature  was  obtained  by  averaging  1000  voiced  analysis  frames 
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(corresponding  to  average  real-time  intervals  of  about  thirty-nine 
seconds) .  With  approximately  eighteen  hours  of  reference  data  and  eighteen 
hours  of  independent  test  data  from  seventeen  speakers,  spaced  over  nearly 
three  months  in  time,  an  average  speaker  identification  score  of  98.05%  and 
an  average  equal  error  speaker  verification  rate  of  4.25%  were  measured. 
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FIGURES 


Fig.  1  Standard  deviation  of  long-term  features  as  a 

function  of  Lv,  the  number  of  voiced  frames  per 
feature  vector. 

Fig.  2  Variance  ratios  from  all  10  sessions  as  a  function 
of  long-term  mean  and  standard  deviations  of 
parameters . 

A)  all  male  speakers 

B)  all  female  speakers.  L^=1QQ0 

Fig.  3  Same  conditions  as  Fig.  2  except 

A)  male  speakers : first  five 

B)  male  speakers : second  five 

Fig.  4  Same  conditions  as  Fig.  2  except  that  Lv=100: 

A)  all  male  speakers 

B)  all  female  speakers 

Fig.  5  Same  conditions  as  Fig.  2  except  only  sessions  1-2 
shown: 

A)  all  male  speakers 

B)  all  female  speakers 

Fig.  6  Relations  between  reference  samples  and  test  samples 
for  experimental  results  of  Table  3 . 

Fig.  7  Intra  and  inter-  Speaker  Comparisons 

A)  Cumulative  Probability 

B)  Probability  Density  Estimates 

Fig.  8  Distance  Histograms  and  Models 

A)  Intra-speaker  Distances 

B)  Inter-speaker  Distances 


TABLES 

Table  1  Number  of  feature  vectors  and  average  real-time 
interval  (RTI )  for  each  Lv  condition. 

Table  2  Speaker  recognition  based  on  partitioning  dc'ta  in  half 
and  with  36  long-term  features. 

Table  3  Percent  of  speakers  correctly  identified  as  a  function 
of  the  number  of  reference  sessions. 

Table  4  Performance  with  fundamental  frequency  and  ref  t^tion 
coefficient  mean  and  standard  deviation  long-term 
features,  Lv=1000  (average  real-time  interval  =  39 
seconds) . 
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Fig.  1  Standard  deviation  of  long-term  features  as  a  function 
of  L  ,  the  number  of  voiced  frames  per  feature  vector. 
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Fig.  2  Variance  ratios  from  all  10  sessions  as  a  function  of 
long-term  mean  and  standard  deviations  of  parameters. 

A)  all  male  speakers 

B)  all  female  speakers.  L  =1000 
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Fig.  3  Same  conditions  as  Fig.  2  except 

A) male  speakers : first  five 

B) male  speakers : second  five 
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Fig.  4  Same  conditions  as  Fig.  2  except  that  L  =100: 

A)  all  male  speakers  v 

B)  all  female  speakers 
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PARAMETERS 


Fig.  5  Same  conditions  as  Fig.  2  except  only  sessions  1-2  shown: 

A)  all  male  speakers 

B)  all  female  speakers 
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Fig.  6  Relations  between  reference  samples  and  test  samples  for 
experimental  results  of  Table  3. 
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A )  Intra-speaker  Distances 

B )  Inter-speaker  Distances 


Table  1.  Number  of  feature  vectors  and  average  real-time  interval  (RTI) 


(_  A  \ 

SPEAKER  IDENTIFICATION 
Percent  of  correct  choices  based  on  minimum  distance 


SESSION 

Lv 

REF. 

TEST 

30 

100 

300 

1000 

1-5 

6-10 

6-10 

1-5 

61.20 

59.87 

78.65 

75.48 

88.20 

85.27 

93.34 

89.77 

AVERAGE 

60.54 

77.06 

8674 

91.56 

teo 

SPEAKER  VERIFICATION 

Percent  of  false  acceptances  and  false  rejections  based 

on  equal  error  criterion 


SESSION 

L 

V 

REF 

TEST 

30 

100 

300 

1000 

1-5 

6-10 

43.4 

27.8 

10.7 

9.4 

6-10 

1-5 

42.8 

26.9 

10.5 

8.2 

AVERAGE 

43.1 

27.4 

10.6 

8.8 

Cc) 

SPEAKER  VERIFICATION 
Threshold  distance  based  on  equal  error  criterion 


SESSION 

L 

V 

REF 

TEST 

30 

100 

300 

1000 

1-5 

6-10 

5.79 

752 

9.78 

18.84 

6-10 

1-5 

5.85 

7.58 

10.85 

21.10 

AVERAGE 

5.82 

7.55 

10.32 

19.97 

Table  2.  Speaker  recognition  based  on  partitioning  data  in  half  and  with 
36  long-term  features 


SESSIONS 

L 

V 

NO. 

REF 

TEST 

30 

100 

300 

1000 

2 

1-2 

3-4 

50.36 

64.34 

71.18 

2 

3-4 

1-2 

53.45 

67.95 

75.31 

— 

3 

1-3 

4-6 

54.29 

70.03 

79.12 

80.58 

3 

4-6 

1-3 

57.04 

72.69 

82.14 

89.30 

4 

1-4 

5-8 

59.91 

76.41 

86.73 

92.85 

4 

5-8 

1-4 

59.26 

74.62 

83.45 

86.34 

5 

1-5 

6-10 

61.20 

78.65 

88.20 

93.34 

5 

6-10 

1-5 

59.87 

75.48 

85.27 

89.77 

Table  3.  Percent  of  speakers  correctly  identified  as  a  function  of  the 
number  of  reference  sessions 


FINAL  RESULTS  OF  2-WAY  TESTING  ON  38  HOURS  OF 

EXTEMPORANEOUS  SPEECH 


Table  4.  Performance  with  fundamental  frequency  and  reflection  coefficient 
mean  and  standard  deviation  long-term  features,  Ly  ,  aooo  ,averago  reai_ 
time  interval  =  39  seconds) . 


